Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, [email protected] Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party 10a: 1 Gray &

Download Report

Transcript Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, [email protected] Mon Tue Wed Thur Fri 9:00 Overview TP mons Log Files &Buffers B-tree 11:00 Faults Lock Theory ResMgr COM+ Access Paths 1:30 Tolerance Lock Techniq CICS & Inet Corba Groupware 3:30 T Models Queues Adv TM Replication Benchmark 7:00 Party Workflow Cyberbrick Party 10a: 1 Gray &

Log Manager
Jim Gray
Microsoft, Gray @ Microsoft.com
Andreas Reuter
International University, [email protected]
Mon
Tue
Wed
Thur
Fri
9:00
Overview
TP mons
Log
Files &Buffers
B-tree
11:00
Faults
Lock Theory
ResMgr
COM+
Access Paths
1:30
Tolerance
Lock Techniq
CICS & Inet
Corba
Groupware
3:30
T Models
Queues
Adv TM
Replication
Benchmark
7:00
Party
Workflow
Cyberbrick
Party
10a: 1
Gray & Reuter Log
Log Concept
•
•
•
•
•
•
Log is a history of all changes to the state.
Log + old state gives new state
Log + new state gives old state (not in this picture)
Log is a sequential file.
Complete log is the complete history
Current state is just a "cache" of the log records.
Sunday Master
Monday Master
Tuesday
Transactions
Wednesday Master
Wednesday
Night
Batch
Run
Tuesday
Night
Batch
Run
Monday
Night
Batch
Run
Monday
Transactions
Tuesday Master
Wednesday
Transactions
Archive
Gray & Reuter Log
Monday Master
Tuesday Master
Wednesday Master
10a: 2
How Log is Used
• Recovery from faults
A redundant copy of the state and transitions
• Security audits:
Who did what to whom.
Often too low-level for this.
• Performance Monitor & Accounting:
But only records changes (not reads).
• ISSUES: Who should be allowed to read the log?
It is a security hole.
Must authorize access on a per-record basis.
Gray & Reuter Log
10a: 3
The Log Manager in the Scheme of Things
Archive
Manager
SQL & Other
Transaction Manager
Resource Managers
Lock Manager
Operating
System
File
System
Log Manager
File Manager
Buffer Manager
Media Manager
Interesting thing is the cycle:
Need log to recover archive to recover log.
Gray & Reuter LogBreak the cycle with a bootstrap file.
10a: 4
Log Is a Sequential File.
Encapsulation of the log: it is a shared resource.
Startup: Log manager holds startup info for all others.
Careful writes: Log manager provides a
• High performance.
• Very reliable
• Semi-infinite
• Archived
Sequential file.
Some RMs keep private logs anyway.
(Notably PORTABLE DB systems.)
Then user or system has to manage multiple logs
10a: 5
Gray & Reuter Log
The Log Table
Log table is a sequential set (relation).
Log Records have standard part and then a log body.
Often want to query table via one attribute or another:
.
RMID, TRID, timestamp,
create domain LSN unsigned integer(64);
-- log sequence number (file #, rba)
create domain RMID unsigned integer;
-- resource manager identifier
create domain TRID char(12);
-- transaction identifier
create table log_table (
lsn
LSN,
-- the record’s log sequence number
prev_lsn LSN,
-- the lsn of the previous record in log
timestamp TIMESTAMP,
-- time log record was created
resource_manager RMID,
-- resource mgr that wrote this record
trid
TRID,
-- id of transaction that wrote this record
tran_prev_lsn LSN,
-- prev log record of this transaction (or 0)
body
varchar,
-- log data: rm understands it
primary key (lsn)
-- lsn is primary key
foreign key (prev_lsn)
-- previous log record in this table
references a_log_table(lsn), -foreign key (tran_prev_lsn)
-- transaction's prev log rec also in table
references a_log_table(lsn), -)
entry sequenced; -- inserts go at end of file
10a: 6
Gray & Reuter Log
Log is complete history
A files B files
Log Table
Log Anchor
Archive
lsn
prev_lsn
resource_mgr
trid
tran_prev_lsn
body
max_lsn,
trid,
min_lsn...
Log anchor points at chain of each transaction.
May maintain other chains.
Log records map to sequence of N-plexed files
Old files are archived.
Eventually, archive files are discarded (weeks, months, never)
Gray & Reuter Log
10a: 7
The Log LSN
Each log record has a logical sequence number.
This number (LSN for Log Sequence Number) plays a
key role in many algorithms.
Key property MONOTONICITY:
If action A happened after action B then
LSN(A) > LSN(B).
10a: 8
Gray & Reuter Log
Reading The Log
long log_read_lsn( LSN lsn,
/* lsn of record to be read
*/
log_record_header header, /* header fields of record to be read */
long
offset, /* offset into body to start read
*/
pointer
buffer, /* buffer to receive log data
*/
long
n);
/* length of buffer
*/
LSN log_max_lsn(void); /* returns the current maximum lsn of the log table.*/
Read with C (see next slide) or SQL:
long sql_count( RMID rmid)
/* count log records written by this rmid
{ long
rec_count;
/* count of records
exec sql SELECT count (*)
/* ask sql to scan log counting records
INTO :rec_count
/* written by the calling resource mgr and
FROM log_table
/* place count in the rec_count
WHERE resource_manager = :rmid; /*
return rec_count;
/* return the answer.
};
*/
*/
*/
*/
*/
*/
*/
10a: 9
Gray & Reuter Log
Reading the Log: SQL is easier than C
long c_count( RMID rmid)
/* count log records written by this rmid
{ log_record_header
header; /* structure to receive log record header
LSN
lsn;
/* log sequence number of next log rec
char
buffer[1];/* null buffer to receive log record body.
long
rec_count = 0;
/* count of records
int
n = 1;
/* size of log body returned
if (!log_open(READ)) panic();
/* open the log (authorization check)
lsn = log_max_lsn( );
/* get most recent lsn
while (lsn != NullLSN)
/* scan backward through the log
{ n = log_read_lsn(
lsn,
/* lsn of record to be read
header, /* log record header fields
0L, &buffer, 1L );/* log rec body ignored.
if (header.rmid == rmid)
/* if record written by this RMID then
rec_count = rec_count + 1;
/* increment count
lsn = header.prev_lsn;
/* go to previous LSN.
};
/* loop over LSNs
logtable_close( );
/* close log table
return rec_count;
/* return the answer.
};
/*
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
10a: 10
Gray & Reuter Log
Writing The Log
Add a log record, Log manager fills in header.
LSN log_insert( char * buffer, long n);
/* log body is buffer[0..n-1] */
Force log up to a certain LSN to persistent storage:
LSN log_flush( LSN lsn, Boolean lazy); /* */
(lazy waits for a batch write or timeout == boxcar)
Note: many real interfaces allow some of:
empty buffer: to allow RM to fill it in (avoids data copies)
incremental copy: build the "buffer" in steps.
gather: take log data from many buffers.
Few offer SQL access to the log.
10a: 11
Gray & Reuter Log
Summary Of Log Structure And Verbs
A file
Log pages
in buffer pool
end of
durable
log
B file
durable
storage
log page header
Pages written in next write
current end of log
header
body
Operations: Open/Close
Read(LSN),
Insert(body),
Flush(LSN)
SQL read operations.
Gray & Reuter Log
empty page in
buffer pool
Log Table
10a: 12
Log Anchor Logging and Locking
typedef struct {
filename tablename;
/* name of log table
struct
log_files;/* A & B file prefix names & active file #
xsemaphore lock;
/* semaphore regulates log write
LSN
prev_lsn;
/* LSN of most recent write
LSN
lsn;
/* LSN of next record
LSN
durable_lsn; /* max lsn in durable storage
LSN
TM_anchor_lsn; /* lsn of trans mgr's last ckpt
struct {
/* array of open log parts
long partno;
/* partition number
int
os_fnum;
/* operating system file #
} part [MAXOPENS];
/*
} log_anchor ;
/*
Log records never updated: only inserted and read.
So no locks needed on log.
Semaphore (or something) needed on "end" of log
to manage space/growth/LSN for inserts
Gray & Reuter Log
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
10a: 13
Making Optimistic Log Reads Work
Log is duplexed.
Log manager reads only one copy of the page.
What if the "other" copy has more data?
Trick:
read BOTH copies of FIRST and LAST page in log.
Other pages have "full" flag and a timestamp.
IF not full or timestamp < prev_timestamp THEN
read other page and take highest timestamp
Torn log pages
Log page consists of disk sectors (512B).
Write may only write some sectors.
How detect missing fragments?
1. Checksum?
2. Byte stuffing: stuff a “parity” byte on each page
Gray & Reuter Log
10a: 14
Log Insert
Log semaphore covers
Incrementing LSN
Finding the log end
filling in the page(s)
allocating space on a page, perhaps allocating new pages.
LSN log_insert( char * buffer, long n) /* insert a log record with body buffer[0..n]*/
/* Acquire the log lock (an exclusive semaphore on the log)
*/
Xsem_get(&log_anchor.lock); /* lock the log end in exclusive mode
*/
lsn = log_anchor.lsn;
/* make a copy of the record’s lsn.
*/
/* find page and allocate space in it.
*/
/* fill in log record header & body
*/
/* update the anchors
*/
log_anchor.prev_lsn = lsn;
/* log anchor lsn points past this record
*/
log_anchor.lsn.rba = log_anchor.lsn.rba + rec_len; /*
*/
Xsem_give(&log_anchor.lock); /* unlock the log end
*/
return lsn; };
/* return lsn of record just inserted
*/
10a: 15
Gray & Reuter Log
Log Write Demon
Log Semaphore can be a hotspot so: No IO under semaphore
Allocation (OS requests), and Archiving is done in advance.
Flush to persistent storage (disc) is done asynchronously.
Demons driven by timers and by events (requests)
Demons need not touch end-of-log semaphore
application
programs
resource
managers
log code
log daemon
to flush
(carefully write)
log pages as needed
log daemon
to allocate
new log files
as needed
log data in shared
memory and on disc
10a: 16
Gray & Reuter Log
Careful Writes
If partial pages may be written then
subsequent write may invalidate previous write.
Standard technique:
Serial Writes: write one page then write the second page.
Problem: ~ 1/2 disc bandwidth, 2x delay.
Ping-Pong technique:
Never overwrite good page: Ping-Pong between I and I+1
When complete, assure that page I has final data
Never worse than serial write, generally 2x better.
Disc Page
Disc Page
New Log
Parallel
Ping-Pong
i:
i+1:
Disc Page
Writes
Also note the careful techniques for optimistic reads and torn pages.
10a: 17
Gray & Reuter Log
Group Commit (Boxcaring)
Batch processing of log writes.
If receive 1,000 log force requests/second
why not just execute 50 of them?
Response time will be the same (~20ms).
IOs will be 20x fewer
CPU will be ~ 10x smaller (10x fewer dispatches, 20x fewer OS IO).
Without it, systems are limited to about
50tps no ping-pong
100tps ping-pong.
With it, systems are limited to disc bandwidth >>10ktps.
Group commit threshold can be set automatically.
10a: 18
Gray & Reuter Log
WADS- Giving the Log Disc Zero Latency
Log disc is dedicated, so only has rotational latency.
Reserve some cylinders on the disc as scratch.
For each write:
Write at current position on next track (zero latency).
When have a full-track (or two) of log data
consolidate the write in ram
do a single LARGE write (100KB = 1 rotation) to the log.
cost of this is seek + rotation ~ 20ms.
This reserved area is called the Write Ahead Data Set (WADS).
At restart:
read cylinders
gather recent log data
rewrite end of log.
RAID Write Cache makes this obsolete (if it works).
10a: 19
Gray & Reuter Log
Log: Normal Use
Transaction UNDO During Normal Operation
Transaction log anchor: needed during normal operation
Points to most recent log rec of that transaction.
Follow the transaction prev_lsn chain.
EASY!
10a: 20
Gray & Reuter Log
The Log Anchor: Where It All Starts
REDO/UNDO at System / RM Restart.
Need to bootstrap the most recent log state.
Log manager is the first to restart
Helps Transaction Manager recover
Transaction manager helps Resource mangers recover.
Alternate design (each RM has its own log).
All this depends on rebuilding the log anchor.
Log Anchor
The Log
Previous Transaction
Manager Checpoint Record
Transaction Manager
Checkpoint Record
Resource Manager
Checkpoint Records
10a: 21
Gray & Reuter Log
Preparing For Restart:
Careful Write of Log Anchor
Use the "standard" careful write techniques:
Put the anchor in a special well-known place(s)
Ping-Pong to 2 or more copies
Timestamp each copy
N-plex the copies on devices with independent failures.
Align copies so that writes are "atomic"
Accept most recent copy on pessimistic reads.
Now TM and RMs can bootstrap:
their anchors are in the log.
10a: 22
Gray & Reuter Log
Finding the End of the Log
Find the anchor
If using WADS, go to the WADS area and write log end.
else Scan forward from the most log-anchor lsn
Read optimistic all full pages.
At 1/2 full page or bad page read pessimistic.
Now have end-of log.
Finish 1/2 finished record at end of log and give to TM
Pages
Half-finished record
End of log
Invalid Page
Pages
End of log
Gray & Reuter Log
10a: 23
Archiving The Log And "Old" Transactions
What if transaction/RM low water mark is 1-month old?
Abort?
Copy aside:
copy the undo/redo log records to a side file
Copy forward:
copy the undo/redo log records forward in the file.
Dynamic log:
copy undo records aside (so can online-undo if needed).
All advance the low water mark.
10a: 24
Gray & Reuter Log
Archiving the Log Online
Archive
Staggered
Allocation of
Log Tables on
Secondary Storage
1
2
2
3
1
3
Log
1
2
3
10a: 25
Gray & Reuter Log
The Safety Spectrum
Just UNDO
transactional storage (no durable log)
Just Online Restart:
keep simplexed durable log.
Online plus Off-line Archive (no single point of failure):
periodic copies of data
duplex log
Electronic vaulting:
archive copies and duplexing is done to remote site.
via fast communications links (or Federal Express).
10a: 26
Gray & Reuter Log
Multiple Logs?
Transaction Manager has a log (DECdtm, MS-DTC,…)
Transaction Monitor has a log (CICS, Tuxedo, ACMS,...)
Each DB instance (3 Oracle, 2 Informix, 4 Rdb) has a log.
Some have 3 logs: UNDO, REDO, SNAPSHOT.
Cons
Lots of tapes/files.
Lots of IOs at commit
Lots of things to break.
Pros:
Portable
Performance (in the 1 RM case)
You decide
10a: 27
Gray & Reuter Log
Client/Server Logging
One server design (can be process pair)
Well known log server in the net.
Client sends a BATCH of log records to the server.
Gets back a LSN
Uses "local" LSNs for his objects.
Log servers can be N-plexed processes.
Multi-server design
Client forms a quorum (majority of servers).
Client sends log batch to all, gets back N-LSNs.
If less than majority, client must poll ALL N servers
Servers synchronize their "logical" logs as "sum" of
physical logs (need a majority).
10a: 28
Gray & Reuter Log
Summary
• Log is a sequential file
• Contains entire history of DB
• Many tricks to write it efficiently and
carefully
• Many tricks to archive and recover it
10a: 29
Gray & Reuter Log