The DBA Disaster Diary Real-world Oracle failures: How they were resolved, and how they could have been prevented Copyright © 2004 by Jeremiah Wilton.
Download ReportTranscript The DBA Disaster Diary Real-world Oracle failures: How they were resolved, and how they could have been prevented Copyright © 2004 by Jeremiah Wilton.
The DBA Disaster Diary Real-world Oracle failures: How they were resolved, and how they could have been prevented Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Jeremiah Wilton, OCM www.speakeasy.net/~jwilton [email protected] Independent Oracle Professional Seattle, Washington Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. High Availability • Many companies pursuing HA • Prolonged projects • Researching, designing and implementing • • • • RAC Dataguard Clusters Remote DR sites • Huge capital expenditures Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Benefits of HA Big-Iron 4-week periods around rollout 12 10 8 Hours 6 down Planned Unplanned 4 2 0 Before HA During After HA Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. For my average outage in the past 12 months, would RAC or Dataguard have helped me? Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Hypothetical #1 • 2-node RAC cluster • Array controller fails • Replacement parts 6 hours away • No value from RAC in this outage • Dataguard might have helped… Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Hypothetical #2 DataGuard • Same controller failure with 2-site Dataguard • Standby is missing last few transactions unless using expensive zero data loss • Management: “Wait for the part, save the data.” • Dataguard provides no value in this outage Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Hypothetical #3 • • • • Same 2-site Dataguard Developer’s backfill is missing a where clause All customer names updated to “Fred Flintstone” Developer takes 30 minutes to realize the mistake and alert the DBA • Too late; the standby has committed the changes • Dataguard provides no value in this outage Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Solutions for hypothetical #3 • Flashback Query – Extension of CR mechanism – SMU only – Retention period • Logminer – Generate undo SQL – Apply one rowid at a time. Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Hypothetical #4 Zero data loss DataGuard • • • • RAC Zero data-loss dataguard w/3-day apply delay SMU for flashback query Logminer at the ready Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Hypothetical #4 • • • • • Belt Suspenders Duct tape Dry suit Kevlar helmet Welcome to the Disaster Diary… Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Chapter 1: Double Failure Setting: Internet Startup Circa: June 1997 Staff: 1 novice DBA, 1 systems engineer Hardware: DEC Alphaserver 4800 8 procs Storage: 2 DEC Storageworks HSZ50 arrays of 100Gb of RAID0+1 Operating System: Digital Unix 4.0B Oracle version: 7.3.2.3 Oracle Support: Silver (US) Backups: Daily hot tablespace backup to DLT tape Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Some company background • Independent thinkers • Intelligent • Specific knowledge, certifications less important • Critical thinking • Work ethic • New DBA after two years of developer management • Developers had managed well – – – – – Hot backups v$sesstat tuning tools v$session_event tools Lock monitoring Space management Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Vigilant at investigating new features • 7.3 Maxextents unlimited • Prevent senseless DML failures select 'alter '||segment_type||' '||segment_name|| ' storage (maxextents unlimited);' from dba_segments where segment_type in ('TABLE','INDEX'); • Seemed to work well… Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Planned Outage • During the DBA’s third week • Modify and add some tables (simple DDL) • Planned for late Thurs. PM/Fri. AM Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. DLT problems • Thursday AM before outage, backup failed with a DLT drive error: /opt/app/oracle/product/7.3.2/local/bin/gnutar: Cannot write to /dev/nrmt0h: I/O error BUS free ERROR - os_std, os_type = 11, std_type = 10 (from uerf) • Outage on Fri. AM during backup window • Last good backup before outage is Wed. AM. Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. 15-Minute Outage 1. 2. 3. 4. 5. 6. 7. 8. 9. Put out the closed sign on the website Shut down the listeners Shut down the instance (immediate) Start up the instance in restrict mode Make schema changes Start up the listeners Restart the webserver software so it reconnects to Oracle Make sure the site is working internally Take down the closed sign and let the public back in Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 • Friday 01:00 SE begins outage steps; gets stuck reopening database with ORA-00600 [4000]. • Friday 01:15 SE trying to figure out the problem. – – – – Checks host logs alert file trace file from the ORA-00600 Searches Alta Vista and newsgroups for similar occurrences; no positive results. • Friday 01:45 SE in Silver Support severity-1 call queue. Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Friday 02:00 SE gets Oracle Support in the UK. Analyst opens a TAR – Requests alert and trace files via email. – Will call back when received • Friday 02:45 Oracle Support has not called back – SE calls back in on the TAR – Analyst says she did not receive any email! “Sometimes their email systems can be slow.” – Around this time, SE receives his email bounced by Oracle’s mail server for being too large. Oracle and SE agree on FTP – SE uploads the files – Support analyst will examine and call back shortly Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Friday 04:30 DBA, CTO, VP Tech. arrive. Site now down 3.5 hours. – – – – • VP starts calling people at Oracle The DBA is horrified. DBA & SE call Oracle Support again since no callback yet Support says “corruption of some sort”; restore from backup and roll forward Friday 05:00 DBA & SE begin restore • Friday 06:15 Restore completes – DBA & SE begin roll forward through 48 hours of logs – Unaware of parallel recovery feature; recovering serially • Friday 06:45 – Recovery time projected to take nine hours Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 • Friday 12:00 A “friend” of the company knows Ellison; calls him; – TAR transferred to bug diagnostics and escalation (BDE) – Local Oracle office offers to send personnel on site • Friday 15:30 Roll forward completes. A cold backup is taken before attempting to open the database. • Friday 16:00 • Recovered DB fails on open with ORA-00600 [4000] again – Restore/recover had no effect Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 • Friday 16:45 16 hours into the outage – BDE divulges a known bug: [BUG:434596] prevents a database from being opened if BOOTSTRAP$ has its MAXEXTENTS value modified. Although it is unlikely anyone would intentionally change this, a number of users have run scripts to change the storage on ALL tables and in doing so have encountered this problem. Fixed in 7.3.4. – Restore/Recover was a waste of time – Original Support recommendation lacked specific understanding of problem Roll forward did nothing to fix several-week-old BOOTSTRAP$ extent modification Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Remain calm, think clearly Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Macro timeline • This is a “sleeper” – Data-related problem introduced many backup cycles ago – Sits on disk until database restart – Too much transpired since last “good” backup – No real recovery strategy available to end user Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 • Friday 17:30 BDE requests dial-in access – “Work” copy of the DB placed new DEC Storageworks HSZ50 array for BDE – Original broken DB preserved in place Local Oracle office says on-site support analyst on the way • Friday 18:00 Oracle Support BDE dialed in and working • Friday 19:00 On-site Oracle Support arrives Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Friday 19:53 Media starts to take notice; Reuters report leads to articles in: – Wall Street Journal – TV networks – Online news sites Web site suffers outage LOS ANGELES, June 6 (Reuter) - The World Wide Web site of [the company] went down Friday, preventing Internet customers from ordering … from the online retailer. According to a notice posted on the company's Web page, work was being done on the system and customers should try back later. A spokesman for [the company] was not immediately available to comment on the reason for the outage or how long it would last. • Friday 22:00 BDE uses BBED to patch blocks of BOOSTRAP$’s extent map in datafile 1 of SYSTEM tablespace Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 12:00 18:00 Saturday 00:00 06:00 • Friday 22:15 DB successfully opens! – BDE repaired DB on new storage array – Now running on the new array • Friday 22:30 Sanity testing begins • Friday 23:00 DB host crashes with Unix kernel fault! – 20 minutes to reboot 12:00 18:00 Sunday 00:00 06:00 • Friday 23:30 Site back up and available on the Internet Total outage time: 22 hours and 30 minutes Reduced YTD availability from 99.986% down to 99.729% • Saturday 00:00 Hot backup of repaired DB started • Saturday 01:00 Everyone goes home. Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Saturday 06:15 DBA & SE paged with corruption errors in alert file: Corrupt block dba: 0x4c0530a6 file=19. blocknum=340134. found during buffer read on disk type:6. ver:1. dba: 0x4c0530a6 inc:0x000004e0 seq:0x0000026c incseq:0x00110000 Reread of block=4c0530a6 file=19. blocknum=340134. found same corupted data Errors coming fast, but website is still up – • Small number of read failures Saturday 06:45 SE back on phone with Oracle DBA & CTO on their way back • Saturday 08:00 DBA arrives and joins call with Oracle – – – – BDE engaged Everyone suspects Oracle problem related to block-munching by BDE Non-website internal tools are disabled BDE looks at block dumps; log in to examine the problem Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Saturday 12:00 BDE says new RAID array randomly and steadily corrupting blocks The DBA is horrified. SE makes a point: – – – Although blocks on the disk getting corrupted, redologs are on good array Redologs not getting corrupted No corrupted data being read and rewritten DBA & SE conclude that a good version of each successful transaction preserved in the redologs Middle of day and peak website period – – – DBA & SE decide to keep the database up Will run and in hobbled state until period of low activity Then restore most recent backup to working array & roll forward Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 00:00 06:00 Saturday 16:30 DBA & SE copy cold backup from before first BDE patch to good array – Most recent hot backup taken from bad RAID array – Previous one taken after initial restore and recover but prior to patching – BDE will have to log in and re-patch this copy • Saturday 21:00 Restore & recover in parallel with running DB – DBA & SE start applying Saturday’s logs to the restored database – Corrupting copy still running – Parallel recovery this time • Sunday 03:00 Recovery catches up to the corrupted open production DB – BDE logs in and reapplies block patches for BOOTSTRAP$ bug – Recovered copy still not open; Corrupted copy still running Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. The Outage Friday 01:00 06:00 • 12:00 18:00 Saturday 00:00 06:00 12:00 18:00 Sunday 04:00 Site and running corrupted DB taken down – DBA & SE take cold backup of recovered and repaired database – Last few logs generated by the corrupted system applied • Sunday 06:00 Repaired DB opened; testing begins • Sunday 08:50 Testing complete; Site back up • Sunday 09:00 CEO arrives and takes pictures with disposable camera • Sunday 10:00 VP Tech. takes IT staff out for breakfast but nobody hungry Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Sunday 00:00 06:00 Causes of outage, Avoidance strategies • BOOTSTRAP$ bug “sleeper” – – – – Watch for bugs (no good way) Doing things to dictionary tables (not documented) Reasonably, this was an unavoidable problem Could have been avoided by not trying to be smart • Array corruption (never verified – controller replaced) – Write programs to exercise and verify new arrays Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Sleepers • Introduction of latent problem long before manifestation makes recovery not an option • Changes to many dictionary tables only read at initialization • Extremely insidious; no good recovery path other than BBED Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Time Spent Sum of Minutes Activity Applying redologs Cold backup Initial outage Kernel Panic/Reboot Restoring from tape Testing Troubleshooting Waiting for Oracle Support Working with Oracle Support Top time sinks for this outage • 34% Applying redologs • 23% Waiting for Oracle Support • 12% Troubleshooting • 10% Testing • 8% Working with Oracle Support • 5% Restoring from tape • 5% Taking cold backups • 2% Kernel panic/reboot • 1% Initial outage Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Redolog apply time – 34% • Lack of understanding of the problem • Lack of more recent backup • Ignorance of parallel recovery – – – – Learn about new features Practice variety of recoveries frequently Tune recovery: wait events Become an expert in recovery Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Waiting for Oracle Support – 23% • Managing support is a science • Create your own TAR to avoid errors – MetaLink iTAR – Dial-in to TAR entry system • • • • • Upload files to FTP site before calling in Call in sev-1 and judge analyst’s skill Escalate immediately if necessary Find out when shift ends Watch analyst annotations in TAR and add your own • Make your own judgments on validity of solution Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Troubleshooting – 12% • Too much time spent fiddling – Should establish rules and time limits – When in doubt start the TAR while troubleshooting – Escalate to other staff quickly Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Time Spent Sum of Minutes Activity Applying redologs Cold backup Initial outage Kernel Panic/Reboot Restoring from tape Testing Troubleshooting Waiting for Oracle Support Working with Oracle Support Top time sinks for this outage • 34% Applying redologs • 23% Waiting for Oracle Support • 12% Troubleshooting • 10% Testing • 8% Working with Oracle Support • 5% Restoring from tape • 5% Taking cold backups • 2% Kernel panic/reboot • 1% Initial outage Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Lessons • What technology would have helped? – RAC? – DataGuard? – Backups? • Hypothetical belt & Suspenders, helmet, etc.? Zero data loss DataGuard Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Lessons • Most valuable availability assets: – Solid understanding of internals – Support evasion countermeasures – Excellent support contract and personal contacts • Attend conferences and seek to meet Oracle dev folks • Never do anything to SYS-owned objects without complete understanding • Perform testing on new equipment before placing in production • Troll MetaLink, mailing lists and newsgroups for reports of trouble with new features Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Real Failures • Real failures never resemble the examples in the RAC and DataGuard marketing literature • Real failures are weird, unpredictable and require fast critical thinking • Invest in equipment conservatively and srategically • Invest in education liberally Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Future Disaster Diary Chapters • Share your disasters with me – Anonymity – Acknowledgements – Preferably Oracle-related Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. • Half-day Crisis Management and Disaster Recovery Seminars • Week-long Basic and Advanced Oracle Administration Classes • Remote Emergency DBA Services www.speakeasy.net/~jwilton [email protected] Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author. Q&A Copyright © 2004 by Jeremiah Wilton. Reproduction prohibited without the permission of the author.