PowerPoint Template - Renmin University of China

Download Report

Transcript PowerPoint Template - Renmin University of China

MVCC on Flash Memory
Company
LOGO
Fan Yulei,
Lab of WAMDM,
School of Information,
Renmin University of China,
Beijing, China,
2009-06-13
Outline
Motivation
MVCC
Berkeley DB
PostgreSQL
Future work
Motivation
Characteristics
 Not In-Place Update
HDD
Flash
Motivation
Kinds of Lock
Log File & Data File
Checkpoint: D & S
1st : Lock
2nd : Release Lock
Multiple Version
2PL
ReadRecovery
Log file  Undo & Redo
Backup Database
Hot-standby : mirrored media
CC
Directed Acycling
Graph
Timestamp Ordering
•2PL
•MVCC
Index : B+-Tree
•Conflict graph
•Timestamp
•Index CC
Read Log file  Undo & Redo
Log
•
Transaction
Transaction
•
Media
•
MVCC
Snapshot Isolation
MVCC
 Monoversion Schedule
Conflict cycle: t1,t2
 s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
 s’ = r1(x) w1(x) r2(x) r1(y) w2(y) w1(z) c1 c2
 Multiversion Schedule & Monoversion Schedule
 Multiversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y0) w1(z1) c1 c2
• h(ri(x))=wj(x) & h(wi(x))=wi(x): version function
 Monoversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y2) w1(z1) c1 c2
• s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
• Monoversion Schedule is a special case of Multiversion Schedule
MVCC
 Traditional Conflict
 s = w0(x) c0 w1(x) c1 r2(x) w2(y) c2
 m = w0(x0) c0 w1(x1) c1 r2(x0) w2(y2) c2
 View Equivalent
 Reads-From Relationship
• RF(m) := {(ti, x, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}
 View Equivalent
• trans(m) = trans(m’) and RF(m) = RF(m’)
 Example
• m = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w2(y2) w3(x3) c3 c2
MVCC
 Multiversion View Serializability
 Serializable but not View Equivalent
• m = w0(x0) w0(y0)c0 r1(x0) r1(y0) w1(x1) w1(y1)c1 r2(x0) r2(y1)c2
• s = w0(x) w0(y)c0 r1(x) r1(y) w1(x) w1(y)c1 r2(x) r2(y)c2
 MVSR
• m’ is a serialized monoversion schedule
• trans(m) = trans(m’)
• m and m’ are view equivalent
 Example
• m = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w3(x3) c3 w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• s = w0(x) w0 (y) c0 r3(x) w3(x) c3 w1(x) c1 r2(x) w2(y) c2
MVCC
 Conflict Graph G(m) = (V , E)
r2(x0)r2(y1)
r2(x1)r2(y0)
 V = trans(m) ;
 E = {(ti, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}}
 m and m’ are View Equivalent => G(m) = G(m’)
 Version Oder
 m = w0(x0) w0 (y0) w0 (z0) c0 r1(x0) r2(x0) r2(z0) r3(z0) w1(y1) w2(x2)
w3(y3) w3(z3) c1c2c3 r4(x2) r4(y3) r4(z3) c4
T1
 Version Oder = {x0«x2, y0«y1«y3, z0«z3}
 MVSG
T3
T0



 M
MVSG = G(m) + Version Order
rk(xj) and wi(xi), k≠i≠j
If xi « xj then (ti, tj) ∈E; else (tk, ti) ∈E
∈ MVSR iff MVSG(m, «) have no cycle
T2
T4
MVCC
 Multiversion Conflict
 ri(xj) and wk(xk) and ri(xj) < wk(xk)
 Multiversion Conflict Serializability
 m’ is a serialized monoversion schedule
 trans(m) = trans(m’)
 Pair of operations with conflict: same ordering
 Multiversion Conflict Graph
all
 E={(ti, tk) | ri(xj) < wk(xk) }
MVSR
MCSR
 M ∈ MVCR iff MSVG(m, «) have
VSR no cycle
CSR
MVCC
 Limit the number of version: k=2







w0(x0) c0 r1(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 r1(x0) w1(x1) c1 r2(x1) w2(x2) c2 w3(x3) c3
w0(x0) c0 r1(x0) w1(x1) c1 w3(x3) c3 r2(x3) w2(x2) c2
w0(x0) c0 r2(x0) w2(x2) c2 r1(x2) w1(x1) c1 w3(x3) c3
w0(x0) c0 r2(x0) w2(x2) c2 w3(x3) c3 r1(x3) w1(x1) c1
w0(x0) c0 w3(x3) c3 r1(x3) w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 w3(y3) c3 r2(x3) w2(x2) c2 r1(x2) w1(x1) c1
 K-version view serializability (kVSR):
 Serializable
 View equivalent
 k newest/nearest version
 Hierarchy Relationship
VSR  1VSR  2VSR  3VSR    MVSR
x1,x2
x2,x3
x2,x3
x1,x3
x1,x3
x1,x2
x1,x2
MVCC
 MVCC Protocol
 MVTO (multiversion timestamp ordering)
 MV2PL : 2VPL
• three kinds of kinds: rl, wl, cl
 MVSGT
 ROMV
• Read-only transaction
Berkeley DB
 Five components
a standalone utility
 Deadlock detection
one or more library interfaces
• db_deadlock
• DB_ENV->lock_detect, DB_ENV->set_lk_detect
 Checkpoints
• db_checkpoint
• DB_ENV->txn_checkpoint
 Database and log file archival
• db_archive
• DB_ENV->log_archive
 Log file removal
• db_archive
• DB_ENV->log_archive
 Recovery procedures
• db_recover
• DB_ENV->open
Berkeley DB
 Transaction API
 Transaction Subsystem and Related Methods Description
• DB_ENV->txn_checkpoint,
DB_ENV->txn_stat
• DB_ENV->open
DB_ENV->remove
DB_ENV->txn_recover
DB_ENV->close
 Transaction Subsystem Configuration
• DB_ENV->set_timeout
DB_ENV->set_tx_timestamp
DB_ENV->set_tx_max
 Transaction Operations
• DB_ENV->txn_begin
DB_TXN->commit
DB_TXN->id
DB_TXN->set_name
DB_TXN->abort
DB_TXN->discard
DB_TXN->prepare
DB_TXN->set_timeout
Berkeley DB
 2PL In Berkeley DB
 Locks are released
• during DB_TXN->abort or DB_TXN->commit.
 Guidelines:
• If possible, use nested transactions to protect the parts of
your transaction most likely to deadlock
 Transaction limits
 Transaction IDs: 31-bit unsigned integer (OX80000000)
 Cursors: can not span more transactions, must be opened
and closed within a single transaction
 Multiple Threads of Control:
Berkeley DB
 Several filesystem operations on Berkeley DB
 Disk seek to database file,
Database file read,
Disk seek to log file,
Log file write,
Disk seek to update log file metadata, Log metadata write,
Flush log file information to disk,
Flush log file metadata to disk
 Ways to increase transactional throughput
 Berkeley DB software support group commit
 Additional tuning parameters
• Tune the size of the database cache
• Put the database and the log files on different disks
• Set the filesystem configuration
• Upgrade your hardware
• Turn on DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC flags
– ACI, but not D
PostgreSQL
 PG: a sanpshot of data
 Reading never blocks writing
 Writing never blocks reading
 Three undesirable phenomena
 dirty reads, non-repeatable reads, phantom read
 SQL Transaction Isolation Levels
Isolation Level
Dirty Read
Non-Repeatable Read
Phantom Read
Read uncommitted
Possible
Possible
Possible
Read committed
Not possible
Possible
Possible
Repeatable read
Not possible
Not possible
Possible
Serializable
Not possible
Not possible
Not possible
PostgreSQL
 Read Committed Isolation Level
 the default isolation level
 A SELECT query sees only data committed
 The SELECT does see the effects of previous updates
executed within this same transaction
 Two successive SELECTs can see different data
• Other transactions commit changes during executions
 NOT adequate for many applications that do complex
queries and updates
 Serializable Isolation Level
 This level emulates serial transaction execution.
PostgreSQL
 Data consistency checks at the application level
 Readers in PostgreSQL don't lock data
 To ensure the current existence of a row and protect it
against concurrent updates one must use SELECT FOR
UPDATE or an appropriate LOCK TABLE statement.
(SELECT FOR UPDATE locks just the returned rows
against concurrent updates, while LOCK TABLE protects
the whole table.)
 Lock and Tables
 Table-level Lock
 Row-level : when rows are being updated
 Lock and Index
 Gist and R-tree : released after statement is done
 Hash Index : released after page is processed
 B-Tree : released immediately after each index tuple is
fetched/inserted
ASL
RSL
REL
SUEL
SL
SREL
EL
AEL
AccessShareLock
√
√
√
√
√
√
√
×
RowShareLock
√
√
√
√
√
√
×
×
RowExclusiveLock
√
√
√
√
×
×
×
×
ShareUpdateExclusiveLock
√
√
√
×
×
×
×
×
ShareLock
√
√
×
×
√
×
×
×
ShareRowExclusiveLock
√
√
×
×
×
×
×
×
ExclusiveLock
√
×
×
×
×
×
×
×
AccessExclusiveLock
×
×
×
×
×
×
×
×
SR
DR
IR
UR
AT
DT
CI
LT
√
√
√
√
√
√
√
√
AccessShareLock
RowShareLock
RowExclusiveLock
√
√
√
√
√
√
√
ShareUpdateExclusiveLock
√
ShareLock
√
ShareRowExclusiveLock
√
ExclusiveLock
√
AccessExclusiveLock
√
√
√
Future work
Experiment
BDB & PG Code
Transaction on Flash Memory
 Concurrency Control
• MVCC
 Recovery
• Log
Company
LOGO