DB2 Recovery 101 - Baltimore/Washington DB2 Users Group

Download Report

Transcript DB2 Recovery 101 - Baltimore/Washington DB2 Users Group

DB2 Recovery 101 Overview
Bill Arledge
DB2 Data Management Strategist
Mainframe DB Recovery
7/17/2015
Availability is critical …
u
Lost revenue…
 Up
to $6 million/hour for e-businesses
2
- 3% of annual revenue for every 10
hours of database outage
u
Lost customers…
 Nearly
14 times your initial investment to
win customers back
u
Lost market share…
 Approximately
1/2 point of market share
for every 8 hours of outage
(It will take an estimated 3 years to win
customers back)
2
7/17/2015
©2007 BMC Software
Recovery Elements
FAILURE
ANALYSIS
CREATE
RECOVERY
JCL
RECOVERY
MANAGEMENT
APPLICATION OUTAGE
3
7/17/2015
©2007 BMC Software
EXECUTE
RECOVERY
FAST
UTILITIES
When Availability is Critical,
Recovery is Crucial!
Unplanned downtime is an unfortunate fact of life….


Up to 80% of all unplanned downtime is caused by software or human error*
Up to 70% of recovery is “think time”!
Investigate
20%
Analyze
30%
Diagnose
20%
Recover
30%
*Source: Gartner, “Aftermath: Disaster Recovery”, Vic Wheatman, September 21, 2001
4
7/17/2015
©2007 BMC Software
DB2 Recovery Resource Review
– Large lines = User data flow
• Updates to Tablespace are Logged, Copied
– Small lines = DB2 info flow
• Copies and Logs are registered
• Log range for Tablespace recovery tracked
– MVS ICF Catalog
• Watches over all
SYSLGRNX
ICF
Catalog
Full
Copy
TABLESPACE
or
Incremental
Copies
Archive
Logs
Active
Log
or
SYSCOPY
BSDS
5
7/17/2015
©2007 BMC Software
ICF Catalog
› All DB2 pagesets must be cataloged
– Cataloging updates three MVS system files
• VTOC - Volume Table of Contents
• VVDS - VSAM Volume Data Sets
• ICF - Integrated Catalog Facility
ICF
Catalog
CREATE TABLESPACE
TS000001 IN DP000001
USING VCAT DB2PVCAT
...
VTOC, VVDS
DB2PVCAT.DSNDBC.DP000001.TS000001.I0001.A001
DB2PVCAT.DSNDBD.DP000001.TS000001.I0001.A001
6
7/17/2015
©2007 BMC Software
Boot Strap Data Set (BSDS)
– Boot Strap Data Set (BSDS)
• Active and Archive Log Inventories
• Active Log is reused, Archive Log is serial up to 10000 (V8 and later)
– Current Active Log is DS01
– Next Archive Log will be A004
BSDS
DS03
DS02
DS01
Active
Logs
ACTIVE
RBA
LOG START END
DS01
DS02
DS03
A003
A002
A001
Archive
Logs
7
7/17/2015
©2007 BMC Software
40000 4FFFF Not Reusable
20000 2FFFF Reusable
30000 3FFFF Reusable
ARCHIVE
RBA
LOG START END
A001
A002
A003
STATUS
VOLSER
10000 1FFFF VOL=T00001
20000 2FFFF VOL=T00002
30000 3FFFF VOL=T00003
BSDS and Relationship to the LOG
Incomplete
UR Summary
Record
Open Pageset
Summary
Record
Database and Pageset
Exceptions
Summary Record
LOG
DB2
Checkpoint
BSDS
Active & Archive Log
Dataset Information
DDF Communication
Record
Checkpoint Queue
8
7/17/2015
©2007 BMC Software
Print Log Map (DSNJU004)
› DSNJU004 (Print Log Map) provides
– Active log data set information
– Archive log data set information
– System checkpoints
• Driven by
– LOGLOAD ZPARM
– Active log switch
9
7/17/2015
©2007 BMC Software
Local Time
GMT
DB2 Logs – A Shared Resource
› Primary usage to provide for restart and recovery of DB2
subsystems and objects
– Other uses including audit and data migration
› Log records
– Record updates for all spaces with before and/or after images
• Data and index pages
– Record DDL operations (updates to catalog pages)
– Capture checkpoint and restart information including
• transaction info
• Object exception information
– Are identified by a log point
• RBAs (Relative Byte Addresses) for non-data sharing
• LRSNs (Log Record Sequence Numbers) for data sharing
– Include the page (and row if applicable) being impacted
10
7/17/2015
©2007 BMC Software
DB2 Log Data Flow
INSERT
ONE ROW
– Log Buffers
• RBA (Relative Byte Address)
– Written to Active Log
– Full Active written to Archive
DB2 RECOVERY
LOG MANAGER
LOG BUFFERS
RBA 2000
BEGIN_UR
RECORD
OPEN
PAGESETS
UNDO / REDO UNDO / REDO UNDO / REDO
DATA
INDEX
INDEX
COMMIT
RECORDS
END_UR
RECORD
Checkpoint
RECORD
RBA 1000
Active
Log
11
7/17/2015
©2007 BMC Software
LOGSWITCH
Archive
Log
DB2 Log Data Breakdown
Other
15%
Data
20%
Commit
5%
UNDO
REDO
Checkpoint
10%
REDO
Index
UNDO
12
7/17/2015
©2007 BMC Software
50%
DB2 Log Archival Process
› Active log offload is triggered by several events, including:
–
–
–
–
active log data set is full
Starting DB2 and an active log data set is full
ARCHIVE LOG command
Two uncommon events also trigger the offload:
• An error occurring while writing to an active log data set
• Filling of the last un-archived active log data set
ARCHIVING THE ACTIVE LOG
Write to Active Log
Triggering Event
Offload Process
Write the Archive
Update the BSDS
13
7/17/2015
©2007 BMC Software
DSN1LOGP - Looking at Log Records
› DSN1LOGP is a standalone utility available with DB2
› IBM does not set out to document everything you see in a detail
report
› You can still get lots of information but not easily
› Most recovery experts would find a more sophisticated log tool
handy to
– Re-create SQL from the log
– report and filter on transaction and column data more effectively and
– handle compression and other issues
14
7/17/2015
©2007 BMC Software
DSN1LOGP – Usage Examples
› Who updated that table that's not supposed to be updated?
› Who DROPPED that DATABASE?
› Did BADUSER update anything?
› Are GRANTs and REVOKEs being done outside of our control?
› Who FREEd that PLAN or PACKAGE?
› Are SAVEPOINTs being executed on this subsystem?
› Can I find a common quiet point for two tablespaces for Recovery?
› Sample summary log record for a Unit of Recovery (UOR)
DSN1151I DSN1LPRT UR CONNID=TSO
CORRID=BADUSER
AUTHID=BADUSER
PLAN=DSNESPCS
START DATE=05.059 TIME=14:29:20 DISP=COMMITTED
INFO=COMPLETE
STARTRBA=024AC666C4C7 ENDRBA=024AC666C76B STARTLRSN=BCA426C98AED ENDLRSN=BCA426C98C3E
NID=* LUWID=USBMCN01.DEBALU.BCA4267B984C.0001
COORDINATOR=* PARTICIPANTS=*
DATA MODIFIED:
DATABASE=0600=KMMSEGDB PAGE SET=0002=KMMSEGS
15
7/17/2015
©2007 BMC Software
Identifying Relevant Log
› The SYSLGRNX directory table records log ranges containing
updates to a space (or partition)
– There are entries for each data sharing member updating and
• these entries give the location range on the logs (relative byte address-RBA) and
• the relative time range (log record sequence number--LRSN) to coordinate
with copies and other logs
– SYSLGRNGX output provided by REPORT RECOVERY utility
• Identifies assets required for recovery
16
7/17/2015
©2007 BMC Software
DB2 Directory Table
– SYSLGRNX
• Open update log ranges on the DB2 Log
• Provides for faster recovery
QUIESCE
COPY
Quiet Point
LOG
Open log ranges
SYSLGRNX
17
7/17/2015
DBID
0105
0105
0105
A
B
PSID
000F
000F
000F
C
D
Start Range
A
C
E
©2007 BMC Software
E F
End Range
B
D
F
Current
Spaces Absent in SYSLGRNX
› Some catalog and directory spaces don’t have SYSLGRNX
entries
– DSNDB01.SYSUTILX
– DSNDB01.DBD01
– DSNDB01.SYSLGRNX
– DSNDB06.SYSCOPY
– DSNDB06.SYSGROUP
– DSNDB01.SCT02
– DSNDB01.SPT01
18
7/17/2015
©2007 BMC Software
Log Compatibility
› Since log records reference pages and rows in spaces and
spaces are identified by internal IDs:
– certain activities make one series of log records incompatible with
others
– require a new copy or starting point
– LOAD REPLACE completely resets the data and REORG and
REBUILD change row and key entry locations
– Certain DROPs can be disastrous
19
7/17/2015
©2007 BMC Software
Log Roadblocks
update row,
page E2,
row 1
insert row,
page 1F2,
row 7
log apply
log apply
COPY
20
7/17/2015
REORG is executed
Row is now on page E2, row 1
©2007 BMC Software
COPY
Problem Categories
› An expert categorizes failures and plans and performs
accordingly. There are three common possibilities.
– A media failure destroys data or compromises it (Disk failure or controller
or cache failure occurs)
– Data becomes logically compromised by an incorrect job or transaction
– The data center is unusable (aka disaster)
21
7/17/2015
©2007 BMC Software
Media Failure
› A media failure destroys data or compromises it
– Identify volume contents and RECOVER or REBUILD for traditional
DASD
– Identify objects affected by storage component and RECOVER or
REBUILD bearing mind that some may not be affected because they
weren’t recently updated
22
7/17/2015
©2007 BMC Software
Pop Quiz - Index Recovery
Maybe Not
Recovery to current (for a media failure) would not
require it . Some objects are being recovered to overcome
media failure. Related objects should still be consistent if
they were unaffected by the media failure.
23
7/17/2015
©2007 BMC Software
Logical Data Corruption
› Data has become logically compromised by an incorrect job or
transaction
– An expert finds the cause of the problem.
– An expert knows the possible tools to use
•
•
•
•
24
7/17/2015
whether it is a set of RECOVER and REBUILD statements or
a special program or
a special log tool
or some combination of the above
©2007 BMC Software
Finding Corruption Point
Look for a place where everyone agrees data
wasn’t corrupted. Get as close to now as
possible!
POINT
A
What happened in between
Application Data
Is fine
25
7/17/2015
POINT
B
Application Data
Is Corrupted
©2007 BMC Software
Consistency Point
If RECOVER must be used, a point of consistency across affected
table spaces must be located and any good updates after that point
will be lost.
Online Trans
Batch Job
Application Data
Is fine
26
7/17/2015
Batch Job Rerun
Application Data
Now Corrupted
©2007 BMC Software
Online Trans
Looking for QUIESCE Points
• Official points are recorded in SYSCOPY generally as
a result of execution of the QUIESCE utility
• Useful queries for evaluating available quiesce (quiet)
points on the DB2 log
Identifies all Quiesce
Points after the
specified ICDATE
SELECT DBNAME, TSNAME, ICDATE, ICTIME, HEX(START_RBA)
FROM SYSIBM.SYSCOPY WHERE DBNAME IN ('LSBX', 'LSBQ')
AND ICTYPE = 'Q' AND ICDATE > ’020124' ORDER BY ICDATE, ICTIME;
SELECT HEX(MAX(START_RBA))
FROM SYSIBM.SYSCOPY WHERE DBNAME IN ('LSBX', 'LSBQ')
AND ICTYPE = 'Q';
Identifies the latest
Quiesce point for a set
Of objects
SELECT DBNAME, NAME FROM SYSIBM.SYSTABLESPACE WHERE
DBNAME IN ('LSBX', 'LSBQ')
AND (DBNAME, NAME) NOT IN (
Identifies related objects
SELECT DBNAME, TSNAME
With no entry at the latest
FROM SYSIBM.SYSCOPY WHERE
quiesce point
START_RBA = (
SELECT MAX(START_RBA) FROM SYSIBM.SYSCOPY
WHERE DBNAME IN ('LSBX', 'LSBQ') ) );
27
7/17/2015
©2007 BMC Software
Recovering from Logical Errors
Using SQL Processes
If the result of the batch job was undone with SQL then the online
transactions might be preserved and it would not be necessary to
find a point of consistency for RECOVER.
Online Trans
Online Trans
Online Trans
Online Trans
Online Trans
Online Trans
Batch Job Rerun
Application Data
Now Corrupted
28
7/17/2015
Online Trans
Online Trans
Online Trans
Batch Job Reversed
SQL INSERT
©2007 BMC Software
SQL DELETE
Caveats for a SQL approach
› Using SQL to correct logical errors has some possible pitfalls
• The transactions being preserved may have depended on the incorrect
data
transaction changes
addresses
29
7/17/2015
transaction changes
salesmen based on
addresses
©2007 BMC Software
Caveats for a SQL approach
› Using SQL to correct logical errors has some possible pitfalls
• The transactions that ran during or after the corruption may have also
updated the same column in some of the rows corrupted
many employees
401K deductions set
to zero
30
7/17/2015
Employee requests to
set new percentages for
401K processed
©2007 BMC Software
Caveats for a SQL approach
› Allowing access to application spaces while they
are corrupted may cause problems as in these
examples
• If a group of customers were accidentally deleted from your data
base, then allowing salesmen to continue placing orders might cause
them to recreate customer rows because they don’t see the rows.
When a insert is attempted for the customer to correct the delete, it
will likely receive -803 or cause an improper duplicate customer
record
• Customer shipping addresses corrupted might cause a label to be
printed (read only) and packages to be misdirected
31
7/17/2015
©2007 BMC Software
Image Copy
› An image copy is a sequential dataset
– Contains page images from the tablespace or indexspace
– Represents at least one data set of a space and at most a complete space
(all data sets or partitions)
› Image Copies
– Can be made while changes are taking place (SHRLEVEL CHANGE) or
– Can be made allowing only reads so they are consistent (SHRLEVEL
REFERENCE)
– Registered in SYSCOPY and accessible via SQL SELECT
– REPORT RECOVERY identifies copy required for recovery
• No guarantee that a copy in SYSCOPY is not deleted or not cataloged.
– Can be used to UNLOAD data
– Deleted by
• DROP DDL against the space
• Potentially by the MODIFY utility
32
7/17/2015
©2007 BMC Software
Image Copy Types
› Multiple, identical image copies (four) may be made. They are
identified as:
• Primary or Backup; and
• Local or Recovery site.
› Image copies may be made with only changed pages. These are
incremental image copies.
x nK pages
space
copy
33
7/17/2015
©2007 BMC Software
SHRLEVEL CHANGE Copies
pg
0
pg
1
pg
2
pg
pg
pg
... FFF0 FFF1 FFF2 contents
of copy
page FFF2
update
page 2
update
1
copy
begins
34
7/17/2015
2
5
3
page 2
copied
4
page
FFF2
copied
©2007 BMC Software
copy
ends
External Copies Unknown to DB2
› Data set dumps made by DFDSS or DSN1COPY or other
mechanisms
– Aren’t registered but may be used by
• Restoring known copies that are consistent because the space was
stopped or DB2 was cleanly stopped
• Restoring a complete set of system data
– ‘flash copied’ or ‘snapped’ between the SET LOG SUSPEND and SET LOG
RESUME commands or
– made while DB2 is down after it was taken done cleanly
– and then restarting DB2.
35
7/17/2015
©2007 BMC Software
SYSCOPY MINING
SYSCOPY contains a wealth of data!
36
WHO?
WHAT?
WHEN?
DBNAME
TSNAME
DSNUM
GROUP_MEMBER
DSNAME
JOBNAME
AUTHID
ICTYPE
ICTIME
STYPE
ICDATE
SHRLEVEL
TIMESTAMP
ICBACKUP
START_RBA
7/17/2015
PIT_RBA
OTYPE
©2007 BMC Software
SYSCOPY Example
– DB2 Catalog Table SYSIBM.SYSCOPY
• Backup and recovery point information
IC Type
Description
F
Full
I
Incremental
Q
Quiesce
X
REORG LOG(YES)
37
7/17/2015
©2007 BMC Software
SHRLEVEL Description
R
Reference
C
Change
Spaces not recorded in SYSCOPY
› Three catalog and directory spaces do not have entries in the
SYSCOPY table
– DSNDB01.DBD01
– DSNDB01.SYSUTILX
– DSNDB06.SYSCOPY
› Information on copies for these spaces resides in the DB2 log
38
7/17/2015
©2007 BMC Software
Pop Quiz - Avoid Copies?
PROBOBABLY NOT
Not if REORG or LOAD are used with LOG NO.
39
7/17/2015
©2007 BMC Software
DB2 Recovery Processing
MESSAGES
Image
Copy
RESTORE
SYSCOPY
SYSLGRNG
LOGAPPLY
BSDS
ACTIVE
LOG
40
7/17/2015
ARCHIVE
LOG
©2007 BMC Software
TABLE
SPACE
RECOVER Flavors
› RECOVER can use all the log records to the end of
the subsystem log(s) or
› RECOVER can be instructed to stop at a particular
log point
› RECOVER usually starts by restoring image copies
except in the rare cases where everything is on the
log and
› RECOVER has a LOGONLY feature that assumes the
space is restored outside its control
41
7/17/2015
©2007 BMC Software
Disaster Recovery
› Options from weekly dumps to offsite logging
– Dumps - Simple, cheap, maximum data loss
• Weekly dumps means several days data loss
– Offsite Logs - Complex, expensive, no data loss
• Applying log data to shadow increases expense
– Compromise - Periodic vaulting of Copies & Logs
• Daily or hourly log shipment will minimize data loss
› Good topic for a future presentation
Cost
Complexity
46
7/17/2015
Data Loss
Outage Time
©2007 BMC Software
Expert Summary
› Know the basics and don’t be caught by the myths
› Know the assets you are trying to protect
› Know what you have to protect them with
› Plan for each type of failure and practice if you can
47
7/17/2015
©2007 BMC Software